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Abstract 

As a major source for information on virtually any topic, Wikipedia serves an important 
role in public dissemination and consumption of knowledge. As a result, it presents tremen- 
dous potential for people to promulgate their own points of view; such efforts may be more 
subtle than typical vandalism. In this paper, we introduce new behavioral metrics to quantify 
the level of controversy associated with a particular user: a Controversy Score (C-Score) based 
on the amount of attention the user focuses on controversial pages, and a Clustered Contro- 
versy Score (CC-Score) that also takes into account topical clustering. We show that both these 
measures are useful for identifying people who try to "push" their points of view, by showing 
that they are good predictors of which editors get blocked. The metrics can be used to triage 
potential POV pushers. We apply this idea to a dataset of users who requested promotion to 
administrator status and easily identify some editors who significantly changed their behav- 
ior upon becoming administrators. At the same time, such behavior is not rampant. Those 
who are promoted to administrator status tend to have more stable behavior than comparable 
groups of prolific editors. This suggests that the Adminship process works well, and that the 
Wikipedia community is not overwhelmed by users who become administrators to promote 
their own points of view. 



1 Introduction 

Wikipedia has become a one-stop source for information on nearly any subject. In aggregate, it 
has the power to broadly influence public perceptions. Wikipedia's ubiquity creates strong in- 
centives for biased editing, attracting editors with strong opinions on controversial topics. At the 
same time, Wikipedia is self-policing, and over time the Wikipedia community has formulated a 
comprehensive set of policies to prevent editors from "pushing" their own points of view on the 
readership ("POV pushing"). Most policing takes the form of users editing or reverting disputed 
content, while persistent violations are brought to the attention of administrators. A responding 
administrator has the power to temporarily protect pages from being edited, and to block users 
from editing. POV pushing is not considered vandalism on Wikipedia, the latter term being re- 
served for blatantly ill-intentioned edits. 



While there has been a lot of attention paid to the problems of vandalism and maintenance 
on Wikipedia, there has been little systematic, quantitative investigation of the phenomenon of 
POV pushing. Nevertheless, anecdotal evidence suggests that it is a serious issue. For example, 
in April 2008 a pro-Palestinian online publication called Electronic Intifada released messages 
from the pro-Israel media watchdog group CAMERA (the Committee for Accuracy in Middle 
East Reporting in America) that asked for volunteers to help "keep Israel related entries ... from 
becoming tainted by anti-Israel editors." The messages also contained blueprints explaining how 
members could become Wikipedia administrators, and then use their power to further the goals of 
the organization [1J. In 2006, there was a significant controversy surrounding edits to Wikipedia 
pages of prominent U.S. politicians, made by their own 

To further illustrate the importance of the problem, in an interview with Alex Beam of the 
Boston Globe, Gilead Ini, who initiated the CAMERA campaign, said "[Wikipedia] maybe the most 
influential source of information in the world today, and we and many others think it is broken" 
(3. But another quote from Ini highlights the difficulty of confronting these issues, "Wikipedia 
is a madhouse. We were making a good-faith effort to ensure accuracy." Indeed, Wikipedia poli- 
cies typically assume good faith. The policy on vandalism states "Even if misguided, willfully 
against consensus, or disruptive, any good-faith effort to improve the encyclopedia is not vandal- 
ism. Edit warring over content is not vandalism.'^] In addition to the difficulty of arguing against 
good faith when editors may simply be attempting to disseminate strongly held beliefs, even more 
subtle forms of manipulation can achieve a similar outcome. For example, a manipulative admin- 
istrator may enforce Wikipedia's Neutral Point Of View (NPOV) guidelines selectively, reverting 
only edits that take a particular point of view. Hypothetically a manipulative admin with a con- 
servative viewpoint may revert all edits that seem to push a liberal viewpoint, while leaving those 
that push a conservative viewpoint untouched (or vice versa for a manipulative liberal admin). 
While there is nothing technically "wrong" with this, it can significantly affect the information on 
pages. 

It is worth noting that even if there is attempted manipulation on sensitive pages, it is restricted 
to a relatively small fraction of Wikipedia. Most pages are more "encyclopaedic" in nature (for 
example, pages on mathematics or on algorithms), and generate less controversy than pages that 
deal with current events or ideologies. 

All of this cries out for a useful algorithmic method of detecting potential POV pushing behav- 
ior, and quantitative metrics that can provide evidence and allow us to examine the behavior of 
such users in more detail. In this paper we present two such metrics and then use them to examine 
the behavior of the population of administrators. 

The first metric, the Controversy Score ("C-Score"), measures the proportion of energy an ed- 
itor spends on controversial pages. It works by first assigning a controversy score to each page. 
This score is based on factors that have been identified as well-correlated with controversy, in- 
cluding the number of revisions to an article's talk page, the fraction of minor edits on the talk 
page, and the number of times it has been protected [8]. It is independent of language or content, 
and therefore easily generalizable. An editor's C-Score is then the mean of the controversy scores 
of the pages she edits, weighted by the proportion of her editing attention she focuses on those 
pages. 

While the C-Score is a useful measure, it does not account for the topical clustering of a user's 

i http : / /en . wikipedia . org/wiki/Congressional_Staffe r_Edits retrieved 11/4/2011. 
http : / /en . wikipedia . or g/wiki /Wikipedia : Vandalism retrieved 11/4/2011. 



2 



edits. This is particularly important when we use these scores to assess the behavior of adminis- 
trators, because administrators' responsibilities imply that they will spend more time on contro- 
versial pages in general. However, we would expect users who have strong opinions on a topic to 
push their POV especially in pages related to that topic, rather than broadly across many different 
controversial topics. Consider two editors A and B, with 100 edits each. They each have 25 edits 
on the article about the U.S. Republican party. Editor A's remaining edits are about Republican 
legislators and Republican sponsored legislation from the past 10 years, while B's are divided be- 
tween the IRA, the Catholic Church, and Jimmy Wales. All of B's edits are controversial, but only 
some of A's edits are. While B has more controversial edits, we would intuitively consider A to be 
more suspicious. 

To deal with editors of this form, we introduce the Clustered Controversy Score ("CC-Score"), 
which takes into account the similarity among different pages that a user has edited, in addition 
to how controversial those pages are. We expect the CC-Score to be particularly useful for triage, 
as it is designed to be a high recall measure: it flags potentially manipulative users, who focus 
their attention on specific topic areas that include controversial topics. Of course, some users who 
have editing patterns of this form may be acting in good faith and just have deep interests in that 
topic. 

Both the C-Score and the CC-score are behavioral. They do not rely in any way on the specific 
text of edits, only on the patterns of editing and interaction between editors. We demonstrate the 
validity of the two scores by showing that they have predictive power in discriminating between 
heavy editors who were blocked and equally heavy editors who were not. Having validated 
them on an exogenous measure, we then apply these measures in order to analyze the behavior 
of administrators. We compare the editing patterns of administrators who score highly on the 
C-Score and the CC-Score. The CC-Score identifies administrators who would not have been 
identified by the simple C-Score, because only some of the major pages on the topics they edit are 
highly controversial. These administrators also edit a long tail of related pages, thus influencing 
public perception of the topic at large. 

Finally, we use the CC-Score to test whether or not admins are behaving in a truly manipulative 
manner: first becoming trusted and attaining promotion to admin status, and then using this 
trusted status to push their points of view in particular domains. To do so, we look at changes in 
CC-score before and after the editor stood for promotion (the Request for Adminship, or "Rf A" 
process). While we find several instances of potentially suspicious changes of focus, we also note 
that the overall behavior of the population of editors who become admins is better than that of 
two comparable populations: (1) those who stood for election but failed; and (2) those who edited 
prolifically but never stood for election to administrator status. The population behavior is better 
in the sense that the variance of changes in the CC-Score pre- and post-Rf A is lower, indicating that 
those who fail in their Rf As actually change their behavior more significantly. Thus, Wikipedia 
admins as a population do not misrepresent themselves in order to gain their trusted status. 

1.1 Contributions 

We introduce two new behavioral measures that indicate whether or not a user is trying to push 
his or her point of view on Wikipedia pages. These measures have predictive power on historical 
data: they can determine which users were blocked for disputes related to controversial editing. 
We anticipate that these measures can be used for auditing or triage: they can flag potentially 
suspicious behavior automatically for more detailed human investigation. The measures are be- 
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havioral and general, and do not rely on specifics of text edited by users, and are thus applicable 
beyond Wikipedia. 

We then show how these measures can be used to discover interesting changes in behavior, fo- 
cusing on the behavioral changes of editors who applied for promotion to administrator status on 
Wikipedia. While there are instances of suspicious looking changes in behavior upon promotion 
to administrator status, we find that at the population-level, Wikipedia editors are in fact better 
behaved than the population of prolific editors, in the sense that their behavior is more stable, and 
does not change significantly upon promotion. While there are specific instances that seem suspi- 
cious, our evidence suggests that the Wikipedia adminship process works well at the population 
level: there is no evidence that editors are in general seeking promotion to adminship so they can 
"push" their point of view on the larger population. 

2 Related work 

There is a large literature on many different aspects of Wikipedia as a collaborative community. It 
is now well-established that Wikipedia articles are high quality [5] and very popular on the Web 
[T2|. The dynamics of how articles become high quality and how information grows in collective 
media like Wikipedia have also garnered some attention ffT4"ll4fl. While there has not been much 
work on how Wikipedia itself influences public opinion on particular topics, it is not hard to draw 
the analogy with search engines like Google, which have the power to direct a huge portion of the 
focus of public attention to specific pages. Hindman et al discuss how this can lead to a few highly 
ranked sites coming to dominate political discussion on the Web [6j. Subsequent research shows 
that the combination of what users search for and what Google directs them to may lead to more 
of a "Googlocracy" than the "Googlearchy" of Hindman et al [10J. 

Our work in this paper draws directly on three major streams of literature related to Wikipedia. 
These are, work on conflict and controversy, automatic vandalism detection, and the process of 
promotion to adminship status on Wikipedia. 

There is a significant body of work characterizing conflict on Wikipedia. Kittur et al introduce 
new tools for studying conflict and coordination costs in Wikipedia [8J. Vuong et al characterize 
controversial pages using both disputes on a page and the relationships between articles and con- 
tributors [13] . We use the measures identified by Kittur et al and Vuong et al as a starting point 
for measuring the controversy level associated with a page. This then feeds into our user-level 
C-Score and CC-Score measures. Our results on the blocked users dataset serve as corroborating 
evidence for the usefulness of these previously identified measures. 

Automatic vandalism detection has been a topic of interest from both the engineering perspec- 
tive (many bots on Wikipedia automatically find and revert vandalism), as well as from a scientific 
perspective. Smets et al report that existing bots, while useful, are "far from optimal", and report 
on the results of a machine learning approach for attempting to identify vandalism (TTJ. They 
conclude that this is a very difficult problem to solve without incorporating semantic information. 
While we touch on vandalism in dealing with blocked users, we are focused on "POV pushing" 
by extremely active users who are unlikely to engage in petty vandalism, which is the focus of 
most work on automated vandalism detection. 

Wikipedia administrator selection is an independently interesting social process. Burke and 
Kraut study this process in detail and build a model for which candidates will be successful once 
they choose to stand for promotion and go through the Request for Adminship (RfA) process 



4 



0. The dataset of users who stand for promotion is useful because it allows us to compare both 
previous and later behavior of users who were successful and became admins and those who did 
not. 

Finally, we use a similarity metric for articles based on editors which is similar to existing work 
on expert-based similarity 13. 

3 Methodology 

We begin by discussing our methodology in computing the "simple" Controversy Score for each 
user, and then describe how we can compute a Clustered Controversy Score that captures editors 
who focus on articles related to a single, controversial topic. 

All data is from an April 2011 database dump of the English Wikipedia. The term "article" 
refers to a page in Wikipedia's article namespace along with any pages in the article talk names- 
pace with the same name, unless otherwise specified. 

3.1 Controversy Score 

We define the C-Score for a user as an edit-proportion-weighted average of the level of controversy 
of each page. The controversy of a page (loosely following the article-level conflict model of Kittur 
et al [8j) is based on the number of revisions to an article's talk page, the fraction of minor edits 
on an article's talk page, and the number of times a page is "protected", where editing by new or 
anonymous users is limited. 

We scale and shift each of the three quantities above such that their 5th and 95th percentiles 
are equal, then take the mean. Next, we transform this number such that the lowest values are 
at -5 and 1% of articles have scores above 0. Finally, the scores are transformed using the logistic 
function 1/(1 + e _t ). This produces a controversy score Ck E [0, 1] for each page. One alternative to 
manual tuning is logistic regression, where a model is trained on a data set reflecting some notion 
of controversy. 

Let pk be the fraction of a user's edits on page k. The controversy score for a user is then an 
edit-weighted average of the page-level controversy scores: 



We would expect this measure to be effective at finding users who edit controversial pages. 
However, many Wikipedia users dedicate at least part of their time to removing blatant vandalism, 
which occurs disproportionally on controversial pages. Thus we turn to a measure that combines 
topical clustering with controversy. 

3.2 Clustered Controversy Score 

We work from the hypothesis that users who concentrate their edits have some vested interest in 
those articles. Going back to the example in the Introduction, we would like to be able to detect 
users like A, who focuses almost entirely on Republican politics. While A's edits include some 
controversial pages, B fights vandalism broadly, and so has exclusively controversial edits. B has 
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the same number of edits to the article on the U.S. Republican Party as A, but the rest of B's edits 
are scattered across other topics. A's edits to this article are interesting; they are topically related 
to A's other edits. At the same time, edits to this article by editor B are far less interesting. 

We would like to incorporate a measure of topical edit concentration into the C-Score. In order 
to do so, we could define topics globally, but this is both expensive and sensitive to parameter 
changes: what is the correct granularity for a topic? Instead, we focus on a local measure of 
topical concentration. Given a similarity metric between articles, we can measure the extent to 
which a user's edits are clustered. 



Page similarity We base our score on a generalization of the clustering coefficient to weighted 
networks with edge weights between and 1 [7J. Several natural measures of page similarity have 
values in this rage. 

We consider pages which link to (or are linked from) the same pages as similar, pages edited 
by the same users as similar, and pages in the same categories as similar. Each page has a set of 
incoming links I, outgoing links 0, users U, and categories T associated with it. To determine 
how similar two pages are based on one of these sets, we divide the cardinality of the intersection 
of the sets from each page by the cardinality of their union (the Jaccard coefficient). To compute 
a single similarity score between two pages i and j, we take an average of scores for each type of 
set, giving equal weight to links, users, and categories. The similarity score w^ is then: 

llhnij] l^ne^-l ip^u^ ll^nr^ 

Win = - : H H H —< (2) 

3 6\iiUij\ 6 u e^i 3|Z7iUC^-| 3 ir^ u r^i w 

Computing the CC-Score Consider a set of edits from a user. Let N be the number of unique 
pages in this set and W{j be the similarity score between pages i and j. We start with a generaliza- 
tion of the clustering coefficient 0. For a page k, define: 

clu S t(*) = %%^~ (3) 
2^=i 2^=i w ki w kj 

The clustering coefficient will be higher when other pages in the edit set are related to k and to 
each other. When computing the CC-Score for the entire edit set, there are two other factors we 
would like to consider: how much a user concentrated on any given page, and how controversial 
that page is. Let pk be the proportion of edits on page k, and Ck be some measure of controversy. 
Then we have the following coefficient for the edit set: 

TV 

CC — Score = ^^p/ c c/ c clust(fc) (4) 

k=l 

Since Yl<k=iPk = L © is a weighted average. If Ck € [0, 1], then so is (Q. Pushing raw contro- 
versy scores through a sigmoid to produce ensures that this condition holds, and also prevents 
outliers from unduly affecting the final score. 

There is no reason that must be a measure of controversy. Instead, it can measure any prop- 
erty of a page which is of interest. For example, a cjt measuring how much a page relates to global 
warming would yield a ranking of editors based on the extent to which their edits concentrate 



6 



on global warming. The CC-Score is a general tool for ranking single-topic contributors based 
on some property of that topic. We also compute a raw Clustering Score where each page has the 
same Ck - this yields a measure of topical clustering independent of any properties of the particular 
pages. 

We choose a measure that combines clustering and controversy page-wise rather than user- 
wise so that we do not end up with editors who are very topically focused on uncontroversial 
pages (say Flamingos), but also spend a significant fraction of their time combating vandalism 
broadly across a spectrum of topics. We also note that the only Wikipedia-specific contributions 
to the CC-Score are encapsulated in the computation of Ck and wij. The same quantities can be 
computed for a wide variety of collaborative networks. Consider email messages: Wy between 
two threads could be based on senders and recipients, and Ck based on the length of the thread as 
a measure of controversy. These quantities are entirely language independent, although we might 
make use of natural language processing to improve estimates of both similarity and controversy. 

4 Evaluation 

We evaluate our metrics in several different ways. First, to establish their validity, we examine 
whether the metrics provide discriminatory power in identifying potentially manipulative users. 
In order to do so, we need an independent measure of manipulation, so we focus on users that 
were blocked from editing on Wikipedia, and compare them with a similar set who were not 
blocked. One of the goals of our work is to provide an objective metric for analyzing adminis- 
trators, who have gained significant status in Wikipedia. We present some detail on the editing 
habits of the admins who score highest on our metrics. In doing so, we also use our metrics to 
provide fresh insight into what is controversial on Wikipedia, by analyzing the topic distribution 
of edits amongst admins with high CC-Scores. 

A reasonable hypothesis, suggested by the CAMERA messages discussed in Section [T] is that 
people who wish to seriously push their points of view on Wikipedia may try to become admins 
by editing innocuously, and then changing their behavior once they become admins. In order to 
examine this hypothesis, we look at the behavior of admins whose CC-Scores changed signifi- 
cantly, as well as at the distribution of changes in the CC-Score. 

4.1 Blocked users 

Users can be blocked from Wikipedia for a variety of reasons. Reasons for blocks include blatant 
vandalism (erasing the content of a page), editorial disputes (repeatedly reverting another user's 
edits), threats, and more. Many blocks are of new or anonymous editors for blatant vandalism; 
we are not interested in these blocks. 

We are interested in blocks stemming from content disputes. While editors are not directly 
blocked for contributing to controversial articles, controversy on Wikipedia is often accompanied 
by "edit warring", where two or more editors with mutually exclusive goals repeatedly make 
changes to a page (e.g., one editor thinks the article on Sean Hannity should be low priority for 
WikiProject Conservatism, and another thinks it should be high priority). 

We examine a set of users who were active between January 2010 and April 2011. For blocked 
users, we use 180 days of data, directly before the block. For the users who were never blocked, 



7 



Block ROC 




FPR 

Figure 1: ROC curve for CC, Controversy, and Clustering Scores when differentiating between 
blocked and not-blocked users, based on 180 days of data. The CC and Controversy Scores effec- 
tively discriminate between these classes, whereas the Clustering Score does not. 



the 180 days ends on one of their edits chosen randomly. In order to filter out new or infre- 
quent editors, we only consider users with between 500 and 1000 edits during this 180 day period. 
The upper bound removes users who do significant amounts of automated editing; it is not un- 
common for such accounts to be blocked for misbehaving scripts which have nothing to do with 
controversy. By examining only exceptionally active users, we eliminate most petty reasons for 
blocks; users who have made hundreds of legitimate contributions are unlikely to start blanking 
pages. After the filtering, we are left with 178 blocked users and 236 who were never blocked. 

Figure [I] shows the performance of the CC, Controversy, and Clustering Scores when discrim- 
inating between the blocked users and users who were never blocked. Both the CC- and C-Scores 
show significant discriminative power, while Clustering alone is no better than guessing. 

The performance of the CC- and C-Scores on the blocked users data set validates both mea- 
sures for detecting users who make controversial contributions to Wikipedia. Many blocks in this 
data set involve violations of Wikipedia's "3 Revert Rule", limiting the number of contributions 
which an editor can revert on a single page during any 24 hour period, which implies that editors 
are not only making controversial changes but are vigorously defending them. This rule is not 
automatically enforced and does not apply to blatant vandalism; instead, another user must post 
a complaint which is then reviewed by an administrator. It is certainly possible to edit controver- 
sial pages while following the rules closely and never getting blocked, and likewise some blocks 
of active users have nothing to do with controversy (rogue scripts, for example). All else being 
equal, we expect more controversial editors to be blocked more frequently. The discriminative 
power of the CC- and C-Scores provides strong evidence that these scores are correctly detecting 
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controversial editors. 

4.2 Highest scoring admins 

We now turn to examining the behavior of admins through the lens of the Controversy and CC- 
Scores. Where do admins focus their attention, and what is controversial on Wikipedia? To explore 
this issue, we use human evaluations of the edit history of administrators during the 180 days 
after they became an administrator. Without knowing anything about the CC, Controversy, or 
Clustering Scores associated with the edit history, a reviewer analyzed the top 50 pages edited by 
a user and decided which general category, if any, the edits were in. Results for the top and bottom 
100 administrators ranked by each score are presented in Figure [2] 

Figure [2] is useful validation for our methodology: for example, administrators with very low 
CC-Scores were often classified as not editing in a coherent general category. Conversely, users 
with a high CC-Score were much more likely to be topically focused. This effect is even more 
apparent for the raw Clustering Score, with 73% of the bottom 100 admins classified as topically 
unfocused. This implies that our article similarity metric corresponds with an intuitive notion of 
topical similarity. 

There are some interesting differences in the topic areas of the 100 admins with the highest 
C-Scores and the 100 with the highest CC-Scores. As expected, the C-Score picks up a substantial 
chunk of editors with no particular focus, while the CC score does not. These are admins who 
are doing their job of "policing" controversy across a broad spectrum of topics. Surprisingly, the 
C-Score picks up more politically focused editors while the CC-Score picks up many more who 
focus on media and entertainment. 

Specific examples help to elucidate this effect. Table [l] shows the top 6 administrators by CC 
and C-Score respectively for the 180 day period immediately following their promotion (success- 
ful RfA). The CC-Score picks up three media-focused editors while the C-Score does not. These 
editors focus on a specific media franchise (a TV show, for example), editing on this topic almost 
exclusively. The long tail of edits for media-focused users often includes non-controversial articles 
related to the same media franchise, for example articles on minor characters. This comprehensive 
long tail means that the page-level clustering score for pages related to the media franchise is very 
high, including the main pages related to the franchise; these main pages tend to be quite con- 
troversial. This combination of controversy and clustering contributes significantly to the user's 
overall CC-Score, while the long tail of less controversial articles moderates the C-Score. 

We note that all the users with very high C-Scores also have very high CC-Scores (falling at 
least in the 96th and typically in the 99th percentile of CC-Score). However, the converse is not 
true, because of the properties of the CC-Score described above. For example Admins 3, 4, and 
5 in particular are relatively low in C-Score. Admin 5 is media-focused. Admin 4 has a high 
CC-Score despite editing topically unrelated pages; yet the similarity scores between them are 
high. This user focuses on disambiguation pages, all of which share common categories, and 
many of which are maintained by the same users. Major disambiguation pages also turn out 
to be fairly controversial, being of interest to editors contributing to any of the pages they link 
to. This combination of a long tail of "related" disambiguation pages with controversial major 
disambiguation pages leads to a high CC-Score, just as it did for media. As with media, the CC- 
Score highlights a real phenomenon on Wikipedia: WikiProject Disambiguation, of which this user 
is a member, consists of users who focus their efforts on disambiguation pages. Thus the CC-Score 
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Figure 2: Human evaluation of the general category of edits (if any) for administrators directly 
after their Rf A. The 100 highest and 100 lowest scoring administrators for each metric are shown; 
because of overlap, 286 administrators are represented in total. Many users with a high CC-Score 
are topically focused, and the CC-Score finds many more media-focused users than the C-Score. 
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is finding exactly single-topic editors of controversial pages, even if their topic is rather specific to 
Wikipedia. 

The story with Admin 3 is similar to Admins 4 and 5, in that this editor also edits a long tail of 
uncontroversial articles. Qualitatively it is useful that the CC-Score is picking up different people 
than the simple C-Score, and sometimes turning up surprises - this is exactly the kind of behavior 
one would want to pick up on, and it may give the CC-Score an advantage over the simple C-Score 
in terms of detecting subtle manipulation. 

We note that in this paper we are agnostic to what makes a page or a topic controversial, which 
reveals much of interest about Wikipedia, but at the same time our methods are completely gen- 
eral - specifically, we expect they would work well with any measure of controversy, and so the 
techniques can easily be adapted to domain-specific needs. For example, to focus on more tradi- 
tionally controversial single-topic editors, we might consider a modification of the page-level con- 
troversy score which ignores controversy on media-related pages. More generally, the page-level 
controversy score can give preference to any topic; it does not need to be related to controversy at 
all. For example, we could find single-topic editors interested in a specific country. 

4.3 Distribution of CC changes 

While it is interesting to find editors focused on a single, controversial topic, it is not surprising 
that such editors exist; Wikipedia certainly needs domain experts even on controversial topics. 
Sudden changes in behavior, especially increases in the topical concentration or the controversial 
nature of edits, are more surprising; especially so when some level of community trust is involved, 
as with administrators. In particular, an editor changing behavior dramatically shortly after be- 
coming an admin is suspicious. 

The Rf A process Standing for promotion to adminship on Wikipedia is an involved process]^] 
An editor who stands for, or is nominated for adminship must undergo a week of public scrutiny 
which allows the community to build consensus about whether or not the candidate should be 
promoted. A special page is set up on which the candidate makes a nomination statement about 
why she or he should be promoted, based on detailed evidence from their history of contributions 
to Wikipedia. Other users can then weigh in and comment on the case, and typically a large 
volume of support (above 75% of commenters) as well as solid supporting statements from other 
editors are necessary for high-level Wikipedia "bureaucrats" to approve the application. Burke 
and Kraut provide many further details on this process |3j. Wikipedia policies call for nominees 
to demonstrate a strong edit history, varied experience, adherence to Wikipedia policies on points 
of view and consensus, as well as demonstration of willingness to help with tasks that admins 
are expected to do, like building consensus. Burke and Kraut note that the actual value of some 
of these may be mixed: participating in seemingly controversial tasks like fighting vandalism or 
requesting admin intervention on a page before becoming an admin actually seems to hurt the 
chances of success. 

Overall, the Wikipedia community devotes significant effort to the Rf A process, and there is 
a lot of human attention focused on making sure that those who become admins are worthy of 
the community's trust. Now we turn to examining some cases where the behavior of an editor 
changed significantly right after they became an admin. 

"http : / /en . wikipedia . or g/wiki /Wikipedia : Rf a 
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Table 1: The most edited articles by the administrators with the highest CC-Scores (top 6) and high- 
est C-Scores (bottom 6) during the 180 days after they became an admin. Each article is annotated 
with the percentile of its article-level controversy score and the percentage of the administrator's 
edits which were to that article. On top of each table are the percentiles for the CC, Clustering, 
and C-Scores of the administrator during the same period. 
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Table 2: Most edited articles for 180 days before and after becoming an administrator. Users were 
selected from the top ten CC-Score changes. 

Changes in behavior Table [2] shows the article edit history of four administrators for 180 days 
before and 180 days after their successful Rf A. These users were among the top 10 administrators 
ranked by the change in CC-Score between the two periods (Admins 1, 2, 3, and 4 were ranked 1, 
3, 5, and 9 respectively in CC-Score change). For two of the administrators shown (Admin 1 and 
Admin 4), the change in CC-Score is explained by a change in edit distribution; while the general 
category of their edits remains consistent, they concentrated on this category much more heavily 
after their respective Rf As. 

For Admins 2 and 3 in Table |2j the change in CC Score is the result of a rather dramatic shift in 
topic. Admin 2 shifts from mathematics to 9/11 conspiracy theories (several related pages are not 
shown in the table), while Admin 3 shifts from relatively unfocused edits to the Sri Lankan Civil 
War. Upon further examination of their behavior, neither administrator appears to be violating 
Wikipedia policy (instead acting as mediators and enforcing a neutral point of view), and yet the 
changes are quite striking. While it may not be the case for these editors, a similar pattern could 
reflect subtle manipulation by one-sided enforcement of the NPOV guidelines, for example. 
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(a) Full distribution 
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(b) Log-log plot of the right hand tail 



Figure 3: Distribution of changes in the CC-Score before and after successful and unsuccessful 
Rf As, and for users who have never participated in an Rf A. 



Population-level changes This leads us to a more general question. Is there evidence of a 
population-wide change in Admin behavior after successful Rf As? Figure [3] shows the distri- 
bution of CC-Score changes for RfA candidates, both successful and unsuccessful, before and 
after their respective Rf As (again 180 days each), and for a group of 1000 active editors who were 
never nominated for administrator status. The results clearly show that those who stand for pro- 
motion and are successful behave differently at the population-level than those who either stand 
for promotion and fail or those who never stand for promotion at all. In fact, they end up stay- 
ing closer to their previous behavior than either of the other groups - the variance in CC-Score 
changes is higher for the other two groups than it is for editors who had successful RfAs (see 
below for details on the data and statistical tests). This implies strongly that there is no serious 
problem with people becoming admins on Wikipedia in order to push their own point of view. 
There are two reasonable hypotheses that may explain the lower variance in CC-Score changes for 
successful RfA candidates. Either some aspect of the RfA process selects for editors who are less 
likely to change their behavior, or the very fact of becoming an administrator has a "centralizing" 
influence: given their new status, associated with a (real or perceived) higher level of scrutiny, 
administrators become less likely to change their behavior. 

The hypothesis that the RfA process selects for editors who tend not to change their CC-Scores 
is unlikely, as we would then expect this type of user to appear in the population of users who were 
never nominated to become administrators. If this were the case, then the non-Rf A distribution 
would be a mix of the successful and unsuccessful RfA distributions; instead, the unsuccessful 
and non-Rf A distributions are similar to each other and different from successful RfAs. We run a 
second test that provides further evidence for the centralizing hypothesis. We construct a matched 
sample of successful and unsuccessful RfA candidates, matching on the estimated probability pi that 
editor i's RfA will be successful RfA based on i's pre-Rf A behavior. We use the model of Burke 
and Kraut to estimate the pis [3j. For each successful RfA j, we find the editor in the unsuccessful 
RfA set k with p^ closest to pj (throwing out examples that are not within 1 percentage point). We 
then compare the population-level behavior of the two sets of editors we are left with. Now that 
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we have controlled for endogenous factors, we expect that the two populations are very similar 
in intrinsic qualities: the only difference between them should be that the successful ones actually 
became admins and the unsuccessful ones did not. We again find that the population of successful 
admins is significantly different, exhibiting more stable behavior than the population of editors 
who were unsuccessful in their RfAs (see below for details on statistics). This suggests that some 
aspect of actually being an administrator reduces the propensity for significant behavior changes 
(incidentally, this makes administrators who do significantly change their editing behavior all the 
more interesting). 

Data and statistics: For the non-Rf A users, we use edits before and after a randomly selected 
edit. The distributions of changes for unsuccessful and non-Rf A editors have a significantly higher 
variance than the distribution for successful RfAs, with the 95% confidence interval on the ratio 
of the variance of the successful Rf A distribution to the variance among non-Rf A users being 
[0.22, 0.28]. The 95% confidence interval on the same ratio for unsuccessful and non-Rf A users, 
on the other hand, is [0.88, 1.11] (est. 0.99). Further, the Kolmogorov-Smirnov Test rules out an 
identical distribution for successful and unsuccessful distributions (p = 10~ 7 , D = 0.109), and 
between the successful and non-RfA distributions (p = 10~ 4 , D = 0.086). We cannot rule out the 
possibility that the non-RfA and unsuccessful distributions are identical (p = 0.25, D = 0.042). 
Closer examination of the tails of the distributions does not show any differences not already 
explained by the variance. For the matched sample described above, the 95% confidence interval 
for the ratio of the variances of successful and unsuccessful RfA distributions is [0.32, 0.43]; the 
conclusions of the KS-tests are unchanged. 

5 Discussion 

This paper contributes to the literature in two different ways: first, we introduce new behavioral 
metrics for quantifying controversial editing on Wikipedia. The measures we introduce can be 
used (perhaps with domain specific modifications) to triage suspicious behavior for deeper inves- 
tigation. Second, these measures allow us to contribute to the study of Wikipedia as an evolving 
social system: along with showing that Wikipedia admins behave in a stable manner, we also 
identify some intuitively surprising topics of conflict in Wikipedia. 

The Controversy Score (C-Score), measures the extent to which an editor is influencing con- 
troversial pages. The Clustered Controversy Score (CC-Score) builds on the C-Score, finding 
single-topic editors of controversial pages. Both metrics are flexible, since they are language- and 
platform- independent, and can work with different measures of controversy. 

We validate the C- and CC-Scores as user-level measures of controversy on a set of blocked 
users. On a set of administrators, we find several weaknesses of the C-Score: it misses contro- 
versial, single-topic editors in the presence of a long tail of related, but less controversial, pages. 
Further, the C-Score gives undue weight to unfocused vandalism fighting, a common behavior 
among administrators. The CC-Score solves many of these issues, allowing us to find controver- 
sial, single-topic editors, and often finds editors who would not show up ranked highly on just the 
C-Score measure. The CC-score also enables us to identify interesting Wikipedia-specific phenom- 
ena, for example, the substantial levels of controversy associated with some media/entertainment 
specific pages as well as with some disambiguation pages. 

We also show how the CC-Score can be used to analyze behavior changes, both for single ed- 
itors and in aggregate. We find several instances of dramatic shifts in behavior by administrators 
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upon assuming their responsibilities. At the same time, we show that administrators as a group 
change their behavior significantly less than any other group of Wikipedians. This consistency 
appears to be due to the role of administrator itself, rather than being a selection effect. 

Future work While we focus on editors working alone in this paper, an extension of the CC- 
Score might highlight groups of editors influencing a single, controversial topic; this presents in- 
teresting computational and evaluation challenges. Improvements to the CC-Score to better detect 
manipulation might focus on natural language processing, or on non-local aspects of an editor's 
behavior. Being platform-independent, the CC-Score is a useful tool for analyzing behavior in 
general collective wisdom processes; we are interested in applications of the CC-Score to other 
domains. 
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